Add replicate-weight variance and PSU-level bootstrap to dCDH by igerber · Pull Request #311 · igerber/diff-diff

igerber · 2026-04-18T12:50:42Z

Summary

Adds opt-in replicate-weight variance (BRR / Fay / JK1 / JKn / SDR) for dCDH via the library's compute_replicate_if_variance helper, routed through the inline branch in _survey_se_from_group_if (Class A IF sites) and a parallel branch in _compute_heterogeneity_test (Class B). Effective df_survey reduces to min(n_valid) - 1 across IF sites when replicates fail, matching the precedent in efficient_did.py and triple_diff.py.
Adds PSU-level Hall-Mammen wild bootstrap via a group_id_to_psu_code dict threaded through _compute_dcdh_bootstrap. Every target's PSU map is now constructed by group-ID lookup (not positional reuse), so a future horizon-specific masking cannot silently miscluster bootstrap draws — size mismatch raises loudly. Under the default auto-inject psu=group, the identity-map fast path preserves pre-change behavior bit-for-bit.
Locks replicate_weights + n_bootstrap > 0 as mutually exclusive (replicate variance is closed-form; combining would double-count), matching efficient_did.py:989, staggered.py:1869, two_stage.py:251-253.
Warning only fires when PSU is strictly coarser than group (n_psu_eff < n_groups_eff); auto-inject psu=group no longer triggers the stale "PSU-level deferred" warning.
Updates REGISTRY.md lines 651-653 to document both mechanisms; twowayfeweights() accepts replicate designs (diagnostic numbers match fit(..., survey_design=sd).twfe_*).

Methodology references (required if estimator / math changes)

Method names: de Chaisemartin-D'Haultfœuille (dCDH) survey-design analytical variance (library extension) and Hall-Mammen wild PSU bootstrap (library extension).
Paper / source links:
- de Chaisemartin & D'Haultfœuille (2020) AER + dynamic companion paper — dCDH estimator.
- Binder (1983) — TSL stratified-PSU variance formula (used by compute_survey_if_variance).
- Rao-Wu reweighting convention — basis for compute_replicate_if_variance.
- Hall & Mammen multiplier-bootstrap-at-PSU — basis for the opt-in bootstrap path.
Intentional deviations from the source: the dCDH papers assume iid sampling and specify only analytical cohort-recentered plug-in variance; both the group-level IF survey expansion and PSU-level Hall-Mammen bootstrap are documented library extensions in docs/methodology/REGISTRY.md (ChaisemartinDHaultfoeuille Notes on survey expansion + survey+bootstrap contract). R DIDmultiplegtDYN does not natively support either mechanism.

Validation

Tests added/updated: tests/test_survey_dcdh_replicate_psu.py (new file, ~40 tests across 4 classes — TestReplicateClassA, TestReplicateClassB, TestPSUBootstrap, TestInvariants). tests/test_survey_dcdh.py updated to reflect the new contract (replicate-only fits now succeed; replicate + n_bootstrap > 0 rejection tested; auto-inject no-warning invariant tested).
Backtest / simulation / notebook evidence: N/A — analytical replicate variance + Hall-Mammen wild PSU bootstrap are validated via the unit/integration tests in test_survey_dcdh_replicate_psu.py, including convergence-to-TSL parity for JK1 at n=5000 and end-to-end shared-draw bootstrap under a strictly coarser PSU with L_max=2.

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Extends the group-level Taylor Series Linearization survey support landed in PR #307 with two opt-in variance mechanisms: 1. Replicate-weight variance (BRR / Fay / JK1 / JKn / SDR) routed through the unified compute_replicate_if_variance helper at every IF site (overall DID_M, joiners DID_+, leavers DID_-, multi-horizon DID_l, placebo DID^pl_l, heterogeneity beta^het_l, twowayfeweights helper). Uses inline-branch pattern mirroring EfficientDiD (efficient_did.py:1119-1142) and TripleDifference (triple_diff.py:1206-1238). Effective df_survey reduces to min(n_valid) - 1 across IF sites when replicates fail, matching the precedent in efficient_did.py:1133-1135 and triple_diff.py:676-686. 2. PSU-level Hall-Mammen wild bootstrap via group_to_psu_map threaded through _compute_dcdh_bootstrap. Under auto-inject psu=group the identity-map fast path preserves pre-change behavior bit-for-bit. Under an explicitly coarser PSU (e.g., psu=state with county-level groups), all groups in the same PSU receive the same bootstrap multiplier so within-PSU correlation is preserved. Locked contracts: - Replicate weights and n_bootstrap > 0 are mutually exclusive (replicate variance is closed-form; bootstrap would double-count). Raises NotImplementedError, matching efficient_did.py:989 etc. - Strata participate only through analytical TSL variance, not the bootstrap (multiplier weights stay unit-by-unit). - Warning fires only when n_psu_eff < n_groups_eff (strictly coarser PSU). Under auto-inject psu=group the warning is correctly suppressed. Adds 36 regression tests in tests/test_survey_dcdh_replicate_psu.py covering Class A sites (via _survey_se_from_group_if), Class B sites (heterogeneity + twowayfeweights), PSU bootstrap semantics, and cross-cutting invariants (replicate+bootstrap rejection, HonestDiD under replicate). REGISTRY.md lines 651-653 updated to document both mechanisms. Full suite: 525 passing (was 489 before this change).

Addresses local AI review findings: 1. P1: `_slice_psu_map` now enforces strict length equality between `target_size` and the full PSU map. Under the current bootstrap architecture all targets (overall DID_M, joiners DID_+, leavers DID_-, multi-horizon DID_l, placebo DID^pl_l) use the same variance-eligible group ordering from `_eligible_group_ids`, so the old prefix-slicing was a silent no-op. Making the check strict converts a fragile unstated invariant into a loud `ValueError` if a future refactor introduces a target whose group subset differs from the overall ordering (e.g., an aggregated target built from a non-prefix mask) — preventing a silent miscluster bug. 2. P2: `fit()` `survey_design` docstring updated to describe: - Replicate-weight variance support (BRR/Fay/JK1/JKn/SDR). - PSU-level Hall-Mammen wild bootstrap contract. - Replicate + `n_bootstrap > 0` rejection. 3. P2: `twowayfeweights()` docstring updated to document replicate acceptance. 4. P2: Added 3 regression tests in `tests/test_survey_dcdh_replicate_psu.py`: - `test_psu_map_size_mismatch_raises`: exercises the new strict equality check. - `test_generate_psu_or_group_weights_broadcast`: direct unit test of PSU-level broadcast — groups in the same PSU receive the same multiplier within a bootstrap replicate. - `test_generate_psu_or_group_weights_identity`: verifies the identity-map fast path is bit-identical to the non-PSU path. All 321 dCDH-related tests still pass.

Addresses the AI review R2 finding that the prior shared-map + length-assertion approach still relied on an unstated invariant (all bootstrap targets use the variance-eligible group ordering). Changes: - `_compute_dcdh_bootstrap` now accepts `group_id_to_psu_code: Dict[Any, int]` and `eligible_group_ids: np.ndarray` instead of a pre-built flat `group_to_psu_map` array. Every target call constructs its PSU map via `_map_for_target(target_size, group_id_to_psu_code, eligible_group_ids)` — a dict lookup per group ID, not a positional reuse. The length-assertion is retained for the size-mismatch case and now sits alongside a KeyError-based check for missing group IDs. - `_slice_psu_map` removed; `_map_for_target` replaces it. The explicit per-target construction removes the hidden semantic assumption that was previously enforced only by cardinality. - `fit()` builds the dict from `_eligible_group_ids` + the dense PSU codes. The per-target maps are all identical today (because every current target uses `_eligible_group_ids` ordering), but that is now a consequence of the caller's choice rather than a brittle assumption buried in the mixin. - Test `test_psu_map_size_mismatch_raises` replaced by `test_map_for_target_id_lookup`, which exercises the ID-lookup semantics including: - Correct map construction for a given eligible-group ordering. - Non-prefix reordering produces a different map (proving the ordering-from-IDs contract). - Length mismatch → ValueError. - Missing-group → ValueError. Full regression: 401 passing (previously 321).

AI review R3 flagged the remaining positional truncation in the multi-horizon shared-draw path: w_h = shared_weights[:, : u_h.size] Under the current contract every horizon's IF vector uses the full variance-eligible group ordering, so `u_h.size == n_groups_mh` always holds and the slice is a no-op. But the truncation itself is a hidden positional assumption — if a future refactor introduced horizon- specific masking, the shared-weight alignment would break silently. Replace the truncation with: 1. An explicit assertion `u_h.size == n_groups_mh` that raises with a message pointing future refactors at the right fix (threading target-specific group IDs through `multi_horizon_inputs`). 2. Direct reuse of the full `shared_weights` matrix — removes the slice semantic entirely. Add `TestPSUBootstrap.test_multi_horizon_shared_draw_under_coarser_psu` covering an end-to-end L_max=2 + coarser PSU case that exercises the shared-draw path + assertion. Full regression: 312 passing.

github-actions · 2026-04-18T13:06:41Z

Overall Assessment

⚠️ Needs changes

Highest unmitigated severity: P1 on replicate-survey df_survey handling for dCDH inference.

Executive Summary

I cross-checked the changed methodology against the dCDH registry entry and the survey-design contracts. The new replicate-weight IF expansion and PSU-level Hall-Mammen bootstrap are documented library extensions, so those deviations from the papers/R are not defects by themselves: docs/methodology/REGISTRY.md#L651.
P1: dCDH’s new _effective_df_survey() helper overwrites the design-level replicate df with min(n_valid)-1 instead of reducing the design df when replicate solves fail. That can make top-level dCDH inference use a df that is too large on rank-deficient replicate designs, and the final value is never written back to results.survey_metadata, so downstream HonestDiD can disagree with the main dCDH surface.
The new tests only cover clean/full-rank replicate designs (df = R - 1), so they miss both rank-deficient replicate matrices and dropped-replicate paths.
I could not execute the test suite in this sandbox because Python test deps are unavailable (pytest and numpy are missing), so this is a static review.

Methodology

Severity P1. Impact: The new helper at diff_diff/chaisemartin_dhaultfoeuille.py#L4980 returns min(n_valid)-1 whenever any replicate-aware SE has run, ignoring the base replicate df contract in diff_diff/survey.py#L590, which is QR-rank of the analysis-weight matrix minus 1 (matching survey::degf()). The registry note also says the effective df is reduced when replicate solves fail, not replaced: docs/methodology/REGISTRY.md#L652. Main dCDH surfaces use this helper at diff_diff/chaisemartin_dhaultfoeuille.py#L1864, while the heterogeneity branch correctly takes min(df_s, n_valid_het - 1) at diff_diff/chaisemartin_dhaultfoeuille.py#L3470 and diff_diff/chaisemartin_dhaultfoeuille.py#L3647. That inconsistency means top-level dCDH p-values/CIs can be anti-conservative on rank-deficient replicate designs. Separately, results are built with the original survey_metadata at diff_diff/chaisemartin_dhaultfoeuille.py#L2768, and HonestDiD later reads survey_metadata.df_survey at diff_diff/honest_did.py#L973, so downstream bounds can use a different df than the main dCDH inference. Concrete fix: make _effective_df_survey() return min(resolved_survey.df_survey, min(n_valid)-1) when both are defined, then persist that final effective df back into survey_metadata.df_survey before constructing ChaisemartinDHaultfoeuilleResults.
Severity P3. Impact: The core methodology changes themselves are documented in the registry, so I am not flagging the replicate-weight IF expansion or PSU-level Hall-Mammen bootstrap as source-material violations: docs/methodology/REGISTRY.md#L651. Concrete fix: none.

Code Quality

No additional findings beyond the P1 df propagation bug above.

Performance

No findings in the changed code.

Maintainability

No separate maintainability finding. Fixing the authoritative df_survey path will remove the main cross-surface inconsistency.

Tech Debt

No existing TODO entry mitigates the P1 above. The dCDH TODO items at TODO.md#L57 cover placebo SE and parity-test scope, not replicate-df propagation.

Security

No findings.

Documentation/Tests

Severity P2. Impact: The new tests only assert df_survey == R - 1 on clean replicate designs at tests/test_survey_dcdh_replicate_psu.py#L190 and tests/test_survey_dcdh_replicate_psu.py#L687. They do not exercise either of the two cases the new code is supposed to handle specially: rank-deficient replicate matrices (resolved.df_survey < R-1) or site-level dropped replicates (n_valid < resolved.df_survey + 1). Concrete fix: add one regression with duplicated replicate columns (rank-deficient design) and one with an invalid/dropped replicate column; assert that top-level dCDH inference, results.survey_metadata.df_survey, and HonestDiD all use the same reduced df.
Severity P3. Impact: _compute_heterogeneity_test()’s docstring still describes the survey path as Binder TSL only at diff_diff/chaisemartin_dhaultfoeuille.py#L3438, but the implementation now dispatches to replicate variance too at diff_diff/chaisemartin_dhaultfoeuille.py#L3634. Concrete fix: update the docstring to mention replicate-weight variance and the effective-df rule.

Path to Approval

Change _effective_df_survey() so it never exceeds the base replicate design df from ResolvedSurveyDesign.df_survey; it should only reduce that df when n_valid is smaller.
Write the final effective df back into survey_metadata.df_survey before creating ChaisemartinDHaultfoeuilleResults, so honest_did=True, compute_honest_did(results), and exported metadata all consume the same df.
Add regression tests for duplicated replicate columns and dropped/invalid replicate columns, and assert consistent df usage across the main dCDH surface and HonestDiD.

…ance Addresses R1 P1 from PR #311 AI review. The R1 reviewer flagged that `_effective_df_survey()` was overwriting the design-level `resolved_survey.df_survey` with `min(n_valid) - 1` whenever any replicate-aware IF site ran. For replicate designs with rank deficiency, `ResolvedSurveyDesign.df_survey` returns `QR-rank - 1` (per R's `survey::degf()` convention at `diff_diff/survey.py:590`), which can be strictly smaller than `R - 1`. The old behavior produced anti-conservative t-critical values on the top-level dCDH surface. Additionally, the final effective df was never persisted into `survey_metadata.df_survey`, so HonestDiD (which reads from `results.survey_metadata.df_survey` at `honest_did.py:973`) could use a different df than the main dCDH surface. Changes: - `_effective_df_survey()` now returns `min(resolved_survey.df_survey, min(n_valid) - 1)` when both are defined; returns None if base df is undefined (rank ≤ 1) or reduced df is degenerate (< 1). - In `fit()`, immediately before constructing `ChaisemartinDHaultfoeuilleResults`, mutate `survey_metadata.df_survey` to the final effective df (post-all- IF-sites, including heterogeneity). `SurveyMetadata` is a mutable `@dataclass` so direct assignment is safe. - `_compute_heterogeneity_test()` docstring updated to describe the replicate-weight dispatch and effective-df rule (R1 P3). Regression tests in `TestInvariants`: - `test_rank_deficient_replicate_uses_design_df`: duplicated replicate columns → `design_df < R - 1` → persisted `survey_metadata.df_survey` must equal `design_df` (not `R - 1`). - `test_dropped_replicate_reduces_df`: zeroed replicate columns produce rank deficiency in the design → persisted df must be bounded by `design_df`. Full regression: 324 passing.

igerber · 2026-04-18T13:25:49Z

/ai-review

github-actions · 2026-04-18T13:34:17Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 13d5e41f646d39ef2ad9b6c58c3a9542977f8ef9

Overall Assessment
⛔ Blocker

Highest unmitigated severity: P0 on replicate-survey inference when the base replicate design has undefined df_survey (QR rank <= 1).

Executive Summary

The prior re-review finding is only partially fixed: _effective_df_survey() now caps by the design df and the final value is written back to survey_metadata, so the original top-level rank-deficient df_survey bug is improved on the main path. See diff_diff/chaisemartin_dhaultfoeuille.py:L5009-L5040 and diff_diff/chaisemartin_dhaultfoeuille.py:L2697-L2700.
P0: dCDH still violates the replicate-df contract when the replicate design’s base df is undefined. _effective_df_survey() returns None, which safe_inference() treats as z-inference, and the new heterogeneity branch can replace an undefined base df with n_valid_het - 1, yielding misleading non-NaN p-values/CIs where the registry requires NaN inference.
P1: the new heterogeneity integration still bypasses the base design df and can lower survey_metadata.df_survey only after overall_* / event_study_effects inference has already been computed, leaving inconsistent df usage across result surfaces when heterogeneity has the smallest n_valid.
The replicate-weight IF variance and PSU-level Hall-Mammen bootstrap themselves are documented library extensions in the registry, so those changes are not methodology defects. See docs/methodology/REGISTRY.md:L651-L653.
I could not execute the tests in this sandbox because pytest and numpy are unavailable.

Methodology

Severity P0. Impact: the PR newly makes replicate-weight dCDH reachable, but the undefined-df contract is still wrong. The survey registry says replicate df_survey is QR-rank - 1 and is undefined when rank <= 1, which should yield NaN inference, not z-inference. See docs/methodology/REGISTRY.md:L2665-L2670. In dCDH, _effective_df_survey() returns None for that case diff_diff/chaisemartin_dhaultfoeuille.py:L5009-L5040, while safe_inference() treats df=None as normal/z inference and only returns NaN when df <= 0 diff_diff/utils.py:L152-L186. The main dCDH surfaces pass that None straight through, e.g. diff_diff/chaisemartin_dhaultfoeuille.py:L1864-L1867. The new heterogeneity path is even less safe: it starts from resolved.df_survey, but if that is None it explicitly falls back to n_valid_het - 1 diff_diff/chaisemartin_dhaultfoeuille.py:L3499-L3509 and diff_diff/chaisemartin_dhaultfoeuille.py:L3670-L3689. That can silently produce valid-looking p-values/CIs where the registry requires NaN. Concrete fix: keep exported survey_metadata.df_survey=None if that is the project convention, but introduce an internal dCDH inference-df helper that maps “replicate design + undefined effective df” to 0 before every safe_inference() call; in _compute_heterogeneity_test(), never promote an undefined base df to n_valid_het - 1. Add a regression with a rank-1 replicate matrix that still yields finite replicate SEs and assert NaN t_stat/p_value/conf_int on overall, event-study, and heterogeneity outputs.
Severity P1. Impact: the previous top-level fix is not fully carried through the new heterogeneity integration. _compute_heterogeneity_test() re-derives df_s as min(replicate_n_valid_list) - 1 instead of reusing the base-capped _effective_df_survey() diff_diff/chaisemartin_dhaultfoeuille.py:L3499-L3509, so rank-deficient replicate designs can still get a heterogeneity df that exceeds the design df. Separately, the main dCDH inference surfaces are computed before heterogeneity runs, e.g. diff_diff/chaisemartin_dhaultfoeuille.py:L1628-L1630 and diff_diff/chaisemartin_dhaultfoeuille.py:L1864-L1903, but heterogeneity appends to the shared replicate tracker only later diff_diff/chaisemartin_dhaultfoeuille.py:L2620-L2633, after which survey_metadata.df_survey is overwritten with the smaller final value diff_diff/chaisemartin_dhaultfoeuille.py:L2697-L2700. HonestDiD then reads that final metadata df diff_diff/honest_did.py:L970-L978. So when heterogeneity has the smallest n_valid, overall_* / event_study_effects can be anti-conservative relative to heterogeneity_effects, survey_metadata.df_survey, and HonestDiD. Concrete fix: initialize heterogeneity’s starting df from _effective_df_survey(resolved, replicate_n_valid_list) instead of raw min(n_valid)-1, and finalize the shared replicate df before public inference is frozen, or recompute the already-built dCDH inference fields after heterogeneity updates the tracker.
Severity P3. Impact: no defect here. The replicate-weight IF expansion and PSU-level Hall-Mammen wild bootstrap are documented deviations from the dCDH papers / R implementation, so I am not treating those changes as source-material mismatches. See docs/methodology/REGISTRY.md:L651-L653. Concrete fix: none.

Code Quality
No additional findings beyond the duplicated / inconsistent df logic already captured above.

Performance
No findings.

Maintainability
No separate maintainability finding beyond the copied df logic in _compute_heterogeneity_test(), which is already the root cause of the P1.

Tech Debt
No separate finding, but the blocking issues above are not mitigated by existing TODO tracking. The dCDH TODO entries currently cover other items only. See TODO.md:L51-L60.

Security
No findings.

Documentation/Tests

Severity P2. Impact: the new test coverage does not exercise either blocker above. The heterogeneity replicate test only checks finite beta/se tests/test_survey_dcdh_replicate_psu.py:L319-L337, and the new df regressions cover top-level metadata / HonestDiD but do not activate heterogeneity= or the undefined-df (rank <= 1) case tests/test_survey_dcdh_replicate_psu.py:L715-L800. Concrete fix: add one rank-1 replicate design that still produces finite SEs and assert NaN inference on all public dCDH surfaces, plus one rank-deficient replicate design with heterogeneity= asserting consistent df usage across overall_*, event_study_effects, heterogeneity_effects, survey_metadata.df_survey, and HonestDiD.
Severity P3. Impact: the detailed Phase 3 heterogeneity registry note still describes only TSL survey inference docs/methodology/REGISTRY.md:L618-L618, while the implementation and the top-level dCDH survey note now add a replicate-weight branch diff_diff/chaisemartin_dhaultfoeuille.py:L3454-L3471 and docs/methodology/REGISTRY.md:L651-L653. Concrete fix: update the detailed heterogeneity note so the registry has one consistent survey-inference contract.
I could not run the added tests here because the sandbox lacks pytest and numpy.

Path to Approval

Add a dCDH-local inference-df helper that converts “replicate design with undefined effective df” into df=0 for safe_inference(), and use it at every dCDH inference callsite.
In _compute_heterogeneity_test(), stop deriving df_s from raw min(replicate_n_valid_list) - 1; use the same base-capped helper as the rest of dCDH, and never replace an undefined base df with n_valid_het - 1.
Make all public dCDH inference surfaces use one final shared replicate df when heterogeneity= is active: either run heterogeneity before the other safe_inference() calls or recompute the already-built public inference fields afterward.
Add regression tests for a rank-1 replicate design (finite SE, NaN inference required) and for a rank-deficient design with heterogeneity= asserting df consistency across results and HonestDiD.

…stency Addresses R2 P0 + P1 from PR #311 AI review. R2 P0: `_effective_df_survey` returns None when replicate design has undefined base df (QR-rank ≤ 1), but `safe_inference(df=None)` uses z-inference — NOT NaN. So a rank-1 replicate design with finite SE silently produced valid-looking p-values/CIs where REGISTRY.md mandates NaN. R2 P1: `_compute_heterogeneity_test` re-derived its local df_s from `min(replicate_n_valid_list) - 1` directly — bypassing the base-df cap the rest of dCDH uses. When df_s was None, the in-loop fallback promoted it to `n_valid_het - 1`, which rescued a rank-deficient design in heterogeneity inference that the main surface correctly treated as NaN. Separately, heterogeneity appended its own n_valid AFTER main inference had frozen, leaving `overall_t`, `event_study _effects`, etc. computed with a larger-than-final df while HonestDiD (reading the post-heterogeneity `survey_metadata.df_survey`) used the smaller final df. Changes: - Add `_inference_df(effective_df, resolved_survey)` helper next to `_effective_df_survey`: coerces None to 0 when replicate (triggers safe_inference's NaN branch), else returns effective_df as-is. - Wrap every dCDH `safe_inference(df=...)` call site (~14 sites) with `_inference_df(df, resolved_survey)`. - In `_compute_heterogeneity_test`, derive `df_s` via `_effective_df_survey(resolved, list(replicate_n_valid_list or []))` instead of the ad-hoc `min - 1`. When `df_s is None` (undefined base df), keep `df_s_local = None` — never promote. - Add a post-heterogeneity finalization block in `fit()` that re-runs `safe_inference` for every public surface (overall, joiners, leavers, multi-horizon, placebo-horizon, heterogeneity) with the FINAL effective df. This guarantees every surface and `survey_metadata.df_survey` use the same df. - Update REGISTRY.md heterogeneity Note to document replicate-weight dispatch and the effective-df rule. Regression tests (TestInvariants): - `test_rank_1_replicate_forces_nan_inference`: identical replicate columns → rank-1 design → design df None → all inference fields NaN. - `test_heterogeneity_replicate_cross_surface_df_consistency`: rank-deficient replicate design with `heterogeneity=` active — asserts `overall_p_value` matches `t(final_df)` distribution and heterogeneity fields agree. Full regression: 326 passing.

igerber · 2026-04-18T14:11:08Z

/ai-review

github-actions · 2026-04-18T14:22:46Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7e62d4cf479d3c4aede4dc595791d2b29fa7b8b1

Overall Assessment

✅ Looks good

Highest unmitigated severity: P2

Executive Summary

The prior replicate-df blocker appears addressed: replicate designs with undefined base df_survey now route through _inference_df() to force NaN inference instead of silent z-inference, and heterogeneity no longer “rescues” an undefined base df. See diff_diff/chaisemartin_dhaultfoeuille.py#L5079 and diff_diff/chaisemartin_dhaultfoeuille.py#L3731.
I do not see an unmitigated P0/P1 methodology defect in the new replicate-weight variance or PSU-level bootstrap paths; both are documented library extensions in the Methodology Registry. See docs/methodology/REGISTRY.md#L651.
P2: the new replicate-placebo regression test is effectively vacuous because it looks up placebo_event_study[1], but the implementation stores placebo horizons under negative keys. That leaves the new replicate placebo branch unverified. See tests/test_survey_dcdh_replicate_psu.py#L268 and diff_diff/chaisemartin_dhaultfoeuille.py#L2388.
P3: the new registry sentence for replicate HonestDiD df is incomplete. The code uses min(resolved_survey.df_survey, min(n_valid)-1) and preserves undefined base df, but the note states only min(n_valid_across_sites)-1. See docs/methodology/REGISTRY.md#L653 versus diff_diff/chaisemartin_dhaultfoeuille.py#L5109.
I could not execute the tests in this sandbox because numpy / pandas are unavailable.

Methodology

Severity P3. Impact: the new registry note says replicate HonestDiD inference uses df_survey = min(n_valid_across_sites) - 1, but the implementation and the detailed heterogeneity note use the capped rule min(resolved_survey.df_survey, min(n_valid) - 1) and keep rank-deficient base designs undefined. That is a documentation mismatch in the project’s methodology source of truth, even though the code path itself looks correct. See docs/methodology/REGISTRY.md#L653, docs/methodology/REGISTRY.md#L618, and diff_diff/chaisemartin_dhaultfoeuille.py#L5109. Concrete fix: update the registry note to the full rule, including the resolved_survey.df_survey cap and the undefined-base-df case.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining items above are not currently tracked in TODO.md.

Security

No findings.

Documentation/Tests

Severity P2. Impact: test_placebo_under_replicate() does not actually assert on the replicate placebo result it intends to cover. The implementation stores dynamic placebo horizons under negative keys, but the test queries key 1, so it will silently skip the assertion and pass even if the replicate placebo SE path breaks. See tests/test_survey_dcdh_replicate_psu.py#L268, diff_diff/chaisemartin_dhaultfoeuille.py#L2402, and docs/methodology/REGISTRY.md#L546. Concrete fix: assert on res.placebo_event_study[-1] (or iterate the negative keys), and verify at least se, p_value, and conf_int when n_obs > 0.

I could not run the tests here because the sandbox is missing numpy and pandas.

Two non-blocker items from PR #311 AI review R3: - P2: `test_placebo_under_replicate` queried `placebo_event_study[1]` but placebo horizons are stored under NEGATIVE keys (REGISTRY.md line 546 — `-l` corresponds to backward outcome difference `Y_{g, F_g-1-l} - Y_{g, F_g-1}`; storage at `chaisemartin_dhaultfoeuille.py:2402` uses `-lag_l` as the key). The test's assertion silently skipped (`.get(1)` returned None) so the replicate placebo SE path was effectively unverified. Fixed to use key `-1` and added explicit assertions on `se`, `p_value`, and `conf_int`. - P3: REGISTRY line 653 described the replicate-effective df as `min(n_valid_across_sites) - 1`, but the actual rule (and the detailed heterogeneity note at line 618) is the capped form `min(resolved_survey.df_survey, min(n_valid) - 1)`. Updated the wording to include the design-df cap and the undefined-base-df NaN case — matching the code and line 618. Full regression: 316 passing.

igerber · 2026-04-18T14:25:30Z

/ai-review

github-actions · 2026-04-18T14:38:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b48ee534e3a0934d8767fa9f8c22eb24788ccbd6

Overall Assessment

✅ Looks good

Highest unmitigated severity: P2

Executive Summary

The prior re-review issues appear addressed: the replicate placebo test now asserts the negative-key convention in tests/test_survey_dcdh_replicate_psu.py#L268, and the HonestDiD replicate-df note now matches the capped implementation in docs/methodology/REGISTRY.md#L653.
I do not see an unmitigated P0/P1 methodology defect in the new dCDH replicate-weight variance or PSU-level Hall-Mammen bootstrap paths; both are documented library extensions in docs/methodology/REGISTRY.md#L651.
P2: the new cross-surface replicate-df regression test does not actually exercise the late-arriving min(n_valid) case it is meant to protect, because its duplicated-column setup makes the design-rank cap bind before heterogeneity runs. See tests/test_survey_dcdh_replicate_psu.py#L862 and diff_diff/chaisemartin_dhaultfoeuille.py#L2694.
P3: the dCDH methodology docs are still internally inconsistent. The section overview still says replicate weights and PSU bootstrap are deferred, while the updated checklist/notes say they are supported. See docs/methodology/REGISTRY.md#L466 versus docs/methodology/REGISTRY.md#L651.
P3: the updated LLM guide still contains the old “deferred” statement for dCDH survey support. See diff_diff/guides/llms-full.txt#L325.
I could not execute tests here because pytest is unavailable and the Python environment is missing numpy/pandas.

Methodology

Severity P3. Impact: the dCDH methodology registry is internally contradictory. The overview in docs/methodology/REGISTRY.md#L466 still says replicate weights and PSU-level bootstrap are deferred, but the supported-contract checklist and notes in docs/methodology/REGISTRY.md#L651 through docs/methodology/REGISTRY.md#L653 now say they are implemented. Because REGISTRY.md is the methodology source of truth, this creates avoidable confusion about whether the PR is a documented extension or an undocumented deviation. Concrete fix: update the overview sentence at line 466 to match the new contract, and keep only the still-deferred single-period placebo-SE limitation called out separately.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find these items tracked in TODO.md under “Tech Debt from Code Reviews.”

Security

No findings.

Documentation/Tests

Severity P2. Impact: test_heterogeneity_replicate_cross_surface_df_consistency() in tests/test_survey_dcdh_replicate_psu.py#L862 does not create the scenario described by the new recomputation block in diff_diff/chaisemartin_dhaultfoeuille.py#L2694: its duplicated-replicate design makes resolved.df_survey the binding cap before heterogeneity appends anything to _replicate_n_valid_list. That means an incomplete late-propagation bug could still pass the test. Concrete fix: add a targeted regression that monkeypatches compute_replicate_if_variance() to return a smaller n_valid on a later IF site and then assert the final reduced df propagates to all copied public surfaces, not just survey_metadata and the heterogeneity dict.
Severity P3. Impact: the user-facing guide still says dCDH replicate weights and PSU bootstrap are deferred in diff_diff/guides/llms-full.txt#L321 through diff_diff/guides/llms-full.txt#L325, which now contradicts both the updated signature note in diff_diff/guides/llms-full.txt#L264 through diff_diff/guides/llms-full.txt#L269 and the survey feature summary in diff_diff/guides/llms-full.txt#L1673 through diff_diff/guides/llms-full.txt#L1676. Concrete fix: update the stale note at line 325 to reflect the new support matrix.

Addresses PR #311 AI review R4 (1 × P2 + 2 × P3). P2: The cross-surface df consistency test used a duplicated-replicate design where `resolved.df_survey` binds as the cap BEFORE heterogeneity contributes. That scenario can't detect a broken late-propagation path. Added a monkeypatch-based regression (`test_heterogeneity_late_nvalid_propagates_to_all_surfaces`) that: 1. First counts the main-surface calls to `compute_replicate_if_variance` (no heterogeneity). 2. Then runs fit with `heterogeneity=` active and a patched helper that returns a SMALLER n_valid ONLY on the heterogeneity call (call index > main_count), leaving main surfaces at full n_valid. 3. Asserts every public surface (`overall_p_value`, `event_study_effects`, `placebo_event_study`, `heterogeneity_effects`, `survey_metadata.df_survey`) uses the reduced df via t(reduced_n_valid - 1) distribution. The test caught two real bugs that the duplicated-column test missed: - `effective_overall_t/p/ci` are set by copying from the raw `overall_t/p/ci` at line ~2331 (before heterogeneity runs) and are used as the results-constructor inputs at line ~2776. The R2 recompute block only reassigned `overall_*` — so `effective_overall_*` (what actually shipped) stayed on the pre-heterogeneity df. Fix: recompute `effective_overall_*` from `effective_overall_att/se` (covering both the normal and cost-benefit delta paths), and keep `overall_*` in sync for downstream consumers. - `placebo_event_study_dict[-lag]` holds VALUE copies of `placebo_horizon_inference[lag]`, so mutating the inner dict in the recompute block does NOT propagate to the public surface. Fix: explicitly update `placebo_event_study_dict[-lag]` alongside the mutation in the recompute loop. P3a: REGISTRY overview line 466 still said "Replicate weights and PSU-level bootstrap are deferred" — contradicting the updated checklist at lines 651-653. Rewrote to match the new contract. P3b: `diff_diff/guides/llms-full.txt:325` still said "Replicate weights and PSU-level bootstrap deferred". Updated to reflect replicate support + opt-in PSU-level Hall-Mammen bootstrap + the mutual-exclusion contract with `n_bootstrap > 0`. Full regression: 327 passing.

igerber · 2026-04-18T14:46:22Z

/ai-review

github-actions · 2026-04-18T14:58:31Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bcc14c41e18ac4a09ce36f87e044857c7e829250

Overall Assessment

⛔ Blocker

Highest unmitigated severity: P0

Executive Summary

Prior re-review doc drift on dCDH survey support appears resolved: the methodology registry now documents replicate-weight variance and PSU-level Hall-Mammen bootstrap as supported library extensions in docs/methodology/REGISTRY.md:L466 and docs/methodology/REGISTRY.md:L651.
I do not see an undocumented methodology deviation in the new replicate-weight variance or PSU-level bootstrap mechanics themselves; the main issue is an inference-propagation bug on a public results surface.
Severity P0: the new late-df_survey recompute path updates overall/event-study/placebo/heterogeneity outputs, but it does not update results.normalized_effects. When heterogeneity= contributes the binding replicate n_valid, normalized-effects inference can disagree with the final survey_metadata.df_survey and silently report wrong p_value/conf_int (and potentially miss the all-NaN gate when the final replicate df becomes undefined).
The new regression test that claims to cover “every public surface” omits normalized_effects, which is why this bug can pass the new suite.
I could not execute pytest here because pytest is not installed in this environment.

Methodology

Affected methods checked: dCDH survey IF expansion / replicate-weight analytical variance, heterogeneity under replicate weights, and PSU-level Hall-Mammen wild bootstrap. The new extensions are documented in docs/methodology/REGISTRY.md:L466, docs/methodology/REGISTRY.md:L618, and docs/methodology/REGISTRY.md:L651. I did not find an undocumented paper/registry mismatch in those mechanics.

Severity P0. Impact: normalized_effects inference is computed with the pre-heterogeneity _df_survey in diff_diff/chaisemartin_dhaultfoeuille.py:L2472, while heterogeneity can append a smaller replicate n_valid later in diff_diff/chaisemartin_dhaultfoeuille.py:L2628. The final recompute block added to make “every surface” agree starts at diff_diff/chaisemartin_dhaultfoeuille.py:L2694, but it never revisits normalized_effects, and the stale dict is returned at diff_diff/chaisemartin_dhaultfoeuille.py:L2836. That is silent wrong statistical output on a public surface. It is especially bad when the final replicate df becomes undefined, because the project’s NaN contract is to force all inference fields to NaN via diff_diff/chaisemartin_dhaultfoeuille.py:L5107 and diff_diff/utils.py:L152, but normalized_effects would still keep the earlier finite values. Concrete fix: recompute normalized_effects_out[h]["t_stat"], p_value, and conf_int with _final_inf_df inside the post-heterogeneity block, or move normalized-effects inference construction until after _final_eff_df is finalized.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The blocker above is not mitigated by tracking; I did not find a matching entry under TODO.md:L51.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new late-n_valid regression explicitly says it covers “every public surface” but only asserts overall, event-study, placebo, heterogeneity, and survey_metadata.df_survey in tests/test_survey_dcdh_replicate_psu.py:L922 and tests/test_survey_dcdh_replicate_psu.py:L936. normalized_effects is also a public surface, including export via diff_diff/chaisemartin_dhaultfoeuille_results.py:L1216. That gap is why the P0 above can slip through. Concrete fix: extend test_heterogeneity_late_nvalid_propagates_to_all_surfaces() to assert res.normalized_effects[h]["p_value"] and conf_int use the final reduced df; add a second case with reduced_n_valid=1 to verify the all-NaN branch.
I could not run the test suite here because pytest is unavailable.

Path to Approval

Update the final replicate-df propagation block in diff_diff/chaisemartin_dhaultfoeuille.py:L2694 so normalized_effects_out is recomputed with _final_inf_df, or defer building normalized-effects inference until after _final_eff_df is known.
Extend tests/test_survey_dcdh_replicate_psu.py:L922 to cover results.normalized_effects, including a case where late heterogeneity reduces the final df and a case where it makes the final replicate df undefined (n_valid < 2).

…branch Addresses PR #311 AI review R5 P0 + P2. P0: The R2 post-heterogeneity finalization block updated overall, joiners, leavers, multi-horizon event-study, placebo-horizon, and heterogeneity surfaces, but missed `normalized_effects_out` — which is built at line 2472-2487 from a delta-method SE (`SE(DID_l) / delta^D_l`) using the pre-heterogeneity `_df_survey`. Under a late-arriving replicate `n_valid` from heterogeneity, the public `results.normalized_effects` dict shipped stale p-values/CIs and could miss the all-NaN gate when the final replicate df turned undefined. Fix: extend the finalization block to also recompute every `normalized_effects_out[h]` entry with `_final_inf_df`. P2: Extended `test_heterogeneity_late_nvalid_propagates_to_all_ surfaces` to assert `res.normalized_effects` p-values match `t(final_df)` distribution. Added a new case `test_late_nvalid_below_two_forces_all_surfaces_nan` that drives the final `n_valid` to 1 (reduced df = 0 → None via `_inference_df` coercion) and verifies EVERY public surface (overall, event_study_effects, placebo_event_study, heterogeneity_effects, normalized_effects) flips to NaN — not silently keeps the finite values computed with the larger intermediate df. Full regression: 328 passing.

igerber · 2026-04-18T15:01:41Z

/ai-review

github-actions · 2026-04-18T15:10:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: abdfea4630e86fd7079870ee3602696871d2068f

Overall Assessment

✅ Looks good

Highest unmitigated severity: P3

Executive Summary

The prior blocker on replicate-df_survey propagation appears resolved: the final recompute block now updates every affected public surface, including normalized_effects, and persists the final survey df in diff_diff/chaisemartin_dhaultfoeuille.py:L2694-L2801. The new regressions explicitly cover normalized_effects and the all-NaN branch in tests/test_survey_dcdh_replicate_psu.py:L996-L1166.
I do not see an undocumented methodology deviation in the new replicate-weight analytical variance, heterogeneity IF path, or PSU-level bootstrap. Those extensions are now documented in docs/methodology/REGISTRY.md:L618-L618 and docs/methodology/REGISTRY.md:L651-L653.
Severity P3: the new coarser-PSU warning is computed from raw positive-weight groups, not the post-filter eligible-group set actually used by the bootstrap map, so filtered panels can still get a stale “PSU coarser than group” warning in diff_diff/chaisemartin_dhaultfoeuille.py:L1973-L1990 and diff_diff/chaisemartin_dhaultfoeuille.py:L2145-L2165.
Severity P3: two changed-file docstrings still describe the bootstrap as group-clustered only, which is now incomplete after adding the PSU-level survey path in diff_diff/chaisemartin_dhaultfoeuille.py:L302-L310 and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L5-L10.
I could not run pytest here because this environment is missing pytest, numpy, and pandas. I did verify the changed Python files parse successfully.

Methodology

No unmitigated findings. The affected methods are dCDH survey IF expansion / replicate-weight variance, heterogeneity testing under survey weights, and PSU-level Hall-Mammen bootstrap; the new behavior is documented as a library extension in docs/methodology/REGISTRY.md:L618-L618 and docs/methodology/REGISTRY.md:L651-L653, and the previous cross-surface df inconsistency appears fixed in diff_diff/chaisemartin_dhaultfoeuille.py:L2694-L2801.

Code Quality

Severity P3. Impact: the new warning gate uses all positive-weight raw groups, while the actual PSU bootstrap map is built from post-filter eligible_group_ids; if upstream dCDH filtering drops groups, the code can still emit a warning even when the effective bootstrap target is one group per PSU. Concrete fix: compute the warning counts from eligible_group_ids_bootstrap / group_id_to_psu_code_bootstrap (or from the same post-filter mask used to build them), and add a regression with dropped groups sharing PSUs. See diff_diff/chaisemartin_dhaultfoeuille.py:L1973-L1990 and diff_diff/chaisemartin_dhaultfoeuille.py:L2145-L2165.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find a matching tracked limitation under the “Tech Debt from Code Reviews” section of TODO.md:L51.

Security

No findings.

Documentation/Tests

Severity P3. Impact: two in-code docstrings still say dCDH’s bootstrap is clustered at the group level only, which no longer matches the implemented survey PSU path and can mislead future readers about the supported contract. Concrete fix: update both docstrings to say “group-level by default; PSU-level under survey_design when PSUs are coarser than groups,” and point readers to the registry note. See diff_diff/chaisemartin_dhaultfoeuille.py:L302-L310 and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L5-L10.
No additional test-gap findings in the re-review scope. The new regression coverage now exercises the previous normalized_effects gap and the final-NaN path in tests/test_survey_dcdh_replicate_psu.py:L996-L1166. I could not execute the suite in this environment.

Addresses PR #311 AI review R6 (2 × P3 cleanups). P3 #1: Warning gate was computed from raw positive-weight groups, not the post-filter eligible-group set used to build the bootstrap PSU map. Panels where upstream dCDH filtering drops groups that share PSUs with kept groups could emit a misleading "PSU coarser than group" warning even when the effective bootstrap is one group per PSU. Fix: count PSUs and groups from `_eligible_group_ids` (the same set feeding `group_id_to_psu_code_bootstrap`), preserving the within- group-constant-PSU invariant by taking each eligible group's first positive-weight PSU label. P3 #2: Two docstrings said the bootstrap is "clustered at the group level" only — now incomplete after the PSU-level survey path: - `diff_diff/chaisemartin_dhaultfoeuille.py` class docstring: extended to note PSU-level Hall-Mammen wild clustering under `survey_design` with coarser PSU. - `diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py` module docstring: documents the identity-map fast path (auto-inject `psu=group`), the PSU-level broadcast when PSU is strictly coarser, and points to REGISTRY.md for the full contract. Full regression: 318 passing.

igerber · 2026-04-18T15:20:29Z

/ai-review

github-actions · 2026-04-18T15:34:40Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 41474069e87331a1019a4177666ca4cb52b80a9f

Overall Assessment

Looks good

Highest unmitigated severity: P3

Executive Summary

Previous re-review concerns around replicate-df_survey propagation and the stale coarser-PSU warning appear resolved in diff_diff/chaisemartin_dhaultfoeuille.py:1971 and diff_diff/chaisemartin_dhaultfoeuille.py:2714.
The PR affects dCDH survey IF variance, survey heterogeneity SEs, PSU-level multiplier bootstrap, and survey-backed twowayfeweights(). I do not see an unmitigated P0/P1 methodology mismatch relative to the dCDH papers plus the documented library extensions in docs/methodology/REGISTRY.md:651, docs/methodology/REGISTRY.md:652, and docs/methodology/REGISTRY.md:653.
Severity P3: some registry/class/user-facing docs still say dCDH “always clusters at the group level,” which now conflicts with the implemented survey-bootstrap contract.
Severity P3: diff_diff/guides/llms-full.txt:321 still says placebo bootstrap is deferred and that bootstrap only covers DID_M, DID_+, and DID_-, which is stale after this PR’s L_max >= 1 placebo/event-study support.
Static review only. I could not run pytest here because this environment is missing numpy, pandas, and pytest.

Methodology

No unmitigated findings. The affected methods are the dCDH survey IF expansion / analytical variance path, survey heterogeneity regression inference, PSU-level Hall-Mammen wild bootstrap, and survey-backed TWFE aggregation in diff_diff/chaisemartin_dhaultfoeuille.py:3540, diff_diff/chaisemartin_dhaultfoeuille.py:5140, diff_diff/chaisemartin_dhaultfoeuille.py:5238, and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:68. Those extensions are explicitly documented in the registry, so I’m not treating them as methodology defects.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new untracked findings. I did not see a silent-correctness issue being deferred, and I did not find a matching blocker under TODO.md:51.

Security

No findings.

Documentation/Tests

Severity P3. Impact: the methodology docs and user-facing cluster text are now internally inconsistent. The new contract is documented at docs/methodology/REGISTRY.md:653, but older text still says dCDH “always clusters at the group level” in docs/methodology/REGISTRY.md:592, docs/methodology/REGISTRY.md:636, diff_diff/chaisemartin_dhaultfoeuille.py:321, diff_diff/chaisemartin_dhaultfoeuille.py:432, and diff_diff/chaisemartin_dhaultfoeuille.py:503. Concrete fix: update the stale note/checklist item, constructor docstring, and cluster error messages to distinguish unsupported user-specified cluster= from automatic PSU-level clustering under survey_design.
Severity P3. Impact: diff_diff/guides/llms-full.txt:321 still says Phase 2 will add placebo bootstrap support and that bootstrap only covers DID_M, DID_+, and DID_-, but the PR now bootstraps multi-horizon event-study and placebo horizons via diff_diff/chaisemartin_dhaultfoeuille.py:2041, diff_diff/chaisemartin_dhaultfoeuille.py:2086, diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:284, and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:369. Concrete fix: rewrite that note so it says only the single-period DID_M^pl (L_max=None) remains NaN; L_max >= 1 placebos and event-study horizons have analytical/bootstrap inference.

Addresses PR #311 AI review R7 (2 × P3 doc drift cleanups). R7 P3 #1: Several sites still said dCDH "always clusters at the group level" — which was true when the PR was written but is now incomplete given the PSU-level Hall-Mammen wild bootstrap path under `survey_design`. Updated to distinguish user-specified `cluster=` (still unsupported, raises NotImplementedError) from automatic PSU-level clustering (takes over under `survey_design` with strictly-coarser PSUs; identity under auto-inject `psu=group`): - `docs/methodology/REGISTRY.md:592` Note (cluster contract) — rewrote to describe both paths; dropped "Phase 1" framing. - `docs/methodology/REGISTRY.md:636` checklist — added the automatic PSU-level upgrade clause. - `diff_diff/chaisemartin_dhaultfoeuille.py:321` constructor docstring — same contract split. - `diff_diff/chaisemartin_dhaultfoeuille.py:432` / `:503` `cluster=` error messages — removed "Phase 1" phrasing, added PSU-level-under-survey_design context. - `tests/test_chaisemartin_dhaultfoeuille.py:405` regex updated to match the new error wording (no longer pins "Phase 1"). R7 P3 #2: `diff_diff/guides/llms-full.txt:321` said Phase 2 will add multiplier-bootstrap support for placebo and bootstrap covers `DID_M`, `DID_+`, `DID_-` only — both stale after this PR's L_max >= 1 placebo and event-study bootstrap paths. Rewrote to scope the NaN-SE contract to `L_max=None` only and describe the full bootstrap coverage (overall, joiners, leavers, per-horizon event-study, placebo horizons, shared weights for sup-t bands). Full regression: 336 passing.

igerber · 2026-04-18T15:38:34Z

/ai-review

github-actions · 2026-04-18T15:53:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9368ae65b89c347e469c9d711607e50acf434a5a

Overall Assessment

Looks good

Highest unmitigated severity: P3

Executive Summary

Previous re-review concerns around replicate df_survey propagation and the stale coarser-PSU bootstrap warning appear resolved in diff_diff/chaisemartin_dhaultfoeuille.py:L1836-L1909, diff_diff/chaisemartin_dhaultfoeuille.py:L1949-L2023, and diff_diff/chaisemartin_dhaultfoeuille.py:L2718-L2821.
The PR affects dCDH survey IF/replicate variance, survey heterogeneity SEs, PSU-level Hall-Mammen bootstrap, and survey-backed twowayfeweights(). Against the dCDH papers plus the documented registry extensions, I do not see an unmitigated P0/P1 methodology defect in the changed math or variance handling at docs/methodology/REGISTRY.md:L651-L653, diff_diff/chaisemartin_dhaultfoeuille.py:L3540-L3818, and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L68-L399.
Severity P3: the cluster-contract text is still internally inconsistent. New registry text documents automatic PSU-level bootstrap under survey_design, but older user-facing text still says dCDH “always clusters at the group level.”
Severity P3: the LLM guide still says placebo bootstrap is deferred and only covers DID_M, DID_+, and DID_-, which conflicts with the implemented multi-horizon placebo/event-study path.
Static review only. I could not run tests here because the environment lacks numpy, pandas, scipy, and pytest.

Methodology

No findings. Affected methods are the dCDH survey IF/replicate variance path on the main ATT, event-study, and placebo surfaces diff_diff/chaisemartin_dhaultfoeuille.py:L1718-L1768, diff_diff/chaisemartin_dhaultfoeuille.py:L1836-L1909, the survey heterogeneity path diff_diff/chaisemartin_dhaultfoeuille.py:L3540-L3818, PSU-level Hall-Mammen bootstrap diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L68-L399, and survey-backed twowayfeweights() aggregation diff_diff/chaisemartin_dhaultfoeuille.py:L5604-L5694. These are explicitly documented as library extensions in docs/methodology/REGISTRY.md:L618-L618 and docs/methodology/REGISTRY.md:L651-L653, and the code routes through the shared survey helpers consistently.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find a silent-correctness issue being deferred via TODO.md:L51-L83.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the cluster contract is still described inconsistently. The newer survey/bootstrap note says automatic PSU-level Hall-Mammen clustering can occur under survey_design docs/methodology/REGISTRY.md:L651-L653, but older text still says dCDH “always clusters at the group level” in the constructor docstring diff_diff/chaisemartin_dhaultfoeuille.py:L321-L329, both NotImplementedError messages diff_diff/chaisemartin_dhaultfoeuille.py:L432-L442 and diff_diff/chaisemartin_dhaultfoeuille.py:L503-L510, the older registry note docs/methodology/REGISTRY.md:L592-L592, and the mirrored test description/regex tests/test_chaisemartin_dhaultfoeuille.py:L405-L425. Concrete fix: update those stale strings to distinguish unsupported user-specified cluster= from automatic PSU-level bootstrap under survey designs, and align the test wording with that contract.
Severity: P3. Impact: the guide still says placebo bootstrap support is deferred and only covers DID_M, DID_+, and DID_- diff_diff/guides/llms-full.txt:L321-L325, which conflicts with the implemented multi-horizon placebo analytical path diff_diff/chaisemartin_dhaultfoeuille.py:L1718-L1768 and bootstrap path diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L369-L397. Concrete fix: rewrite that note so only single-period DID_M^pl (L_max=None) is called out as NaN, while L_max >= 1 placebos and event-study horizons are described as having analytical/bootstrap inference.

Absorbs #312 (sdid scale fix), #313 (roadmap refresh + dCDH docstring rewrite), and #314 (within_transform convergence warnings) from main. Conflicts resolved in: - diff_diff/chaisemartin_dhaultfoeuille.py: took main's comprehensive Phase 1-3 feature list in the class docstring but merged in the PR #311 group-vs-PSU bootstrap-clustering framing and the replicate-weight survey-support line. Kept the PR #311 'user-specified cluster= not supported + automatic PSU-level under survey_design' wording for the cluster= parameter docstring (strictly more accurate than main's 'always clusters at the group level' text). - diff_diff/guides/llms-full.txt: kept main's more detailed placebo SE contract paragraph (which already distinguishes single-period NaN from multi-horizon analytical/bootstrap) and appended the sup-t / shared-weights / cross-horizon coverage details from the PR #311 update. Kept the PR #311 survey_design signature comment that mentions TSL + replicate + PSU bootstrap. Full regression across touched areas: 336 + 324 passing.

igerber · 2026-04-18T16:30:14Z

/ai-review

github-actions · 2026-04-18T16:48:24Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 0e537b0caefb85db7ff525ebf0d01e11ab70546e

Overall Assessment

✅ Looks good

Executive Summary

Previous re-review P3s around the dCDH cluster-contract wording and the placebo/bootstrap guide text appear resolved in diff_diff/chaisemartin_dhaultfoeuille.py:L334-L355, docs/methodology/REGISTRY.md:L592-L653, diff_diff/guides/llms-full.txt:L321-L325, and tests/test_chaisemartin_dhaultfoeuille.py:L405-L431.
The methodology-sensitive changes are concentrated in the dCDH survey/replicate SE dispatch, heterogeneity survey variance path, PSU-level bootstrap clustering, and survey-backed twowayfeweights() parity at diff_diff/chaisemartin_dhaultfoeuille.py:L638-L827, diff_diff/chaisemartin_dhaultfoeuille.py:L1648-L2055, diff_diff/chaisemartin_dhaultfoeuille.py:L2165-L2851, diff_diff/chaisemartin_dhaultfoeuille.py:L3570-L3858, and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L68-L588.
Against the cited dCDH source material, the core estimator still sits on the paper’s group-level asymptotic-normality / normal-approximation CI setup; the survey IF expansion, replicate-weight variance, and PSU-level bootstrap are now clearly documented as library extensions rather than paper-parity claims. (nber.org)
I do not see an unmitigated P0/P1 in the changed math, variance/SE handling, df propagation, or cluster/control logic.
The added regression coverage is targeted and materially useful: replicate SE across Class A/Class B paths, coarser-PSU bootstrap invariants, rank-deficient replicate-df capping, and NaN inference when replicate df is undefined are all exercised in tests/test_survey_dcdh_replicate_psu.py:L150-L1148 and tests/test_survey_dcdh.py:L320-L409.
Static review only: this environment does not have numpy, pandas, scipy, or pytest, so I could not execute the suite.

Methodology

No findings. The changed estimator paths in diff_diff/chaisemartin_dhaultfoeuille.py:L1648-L2055, diff_diff/chaisemartin_dhaultfoeuille.py:L3570-L3858, diff_diff/chaisemartin_dhaultfoeuille.py:L5170-L5374, and diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py:L68-L588 match the updated registry contract in docs/methodology/REGISTRY.md:L592-L653. The dynamic companion paper’s inference section remains an independent-groups, normal-approximation result; this PR’s survey/replicate/PSU-bootstrap behavior is therefore appropriately treated as documented library extension rather than undocumented deviation. (nber.org)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

No findings. The changed public-facing contract is now internally consistent across the estimator docstring, cluster gate, registry notes, and AI guide text in diff_diff/chaisemartin_dhaultfoeuille.py:L334-L355, docs/methodology/REGISTRY.md:L592-L653, diff_diff/guides/llms-full.txt:L321-L325, and tests/test_chaisemartin_dhaultfoeuille.py:L405-L431.
Validation note: I could not run the tests here because the environment is missing numpy, pandas, scipy, and pytest; this review is based on static inspection of tests/test_survey_dcdh_replicate_psu.py:L150-L1148, tests/test_survey_dcdh.py:L320-L409, and tests/test_survey_dcdh.py:L761-L785.

…ariance-extensions

…odo config Packages five merged PRs since v3.1.2 as patch release 3.1.3: - #311 Replicate-weight variance and PSU-level bootstrap for dCDH — new variance_method="replicate" (BRR / Fay / JK1 / JKn / SDR) and PSU-level multiplier bootstrap, with df-aware inference and group-level PSU map. - #321 Zenodo DOI auto-minting config — .zenodo.json + top-level LICENSE so the next GitHub Release mints a concept + versioned DOI automatically. - #319 Silent sparse->dense lstsq fallback signaling in ImputationDiD and TwoStageDiD — emits ConvergenceWarning instead of switching paths silently. - #317 Non-convergence signaling in TROP alternating-minimization solvers, including LOOCV and bootstrap aggregation. Top-level warning aggregation. - #320 /bump-version skill now updates CITATION.cff; single RELEASE_DATE resolved upfront and threaded through all date-bearing files. Version strings bumped in diff_diff/__init__.py, pyproject.toml, rust/Cargo.toml, diff_diff/guides/llms-full.txt, and CITATION.cff (version: 3.1.3, date-released: 2026-04-18). CHANGELOG populated with Added / Fixed / Changed sections and comparison-link footer. Per project SemVer convention, minor bumps are reserved for new estimators or new module-level API; additive extensions to existing estimators (like PR #311's new variance_method values) are patch-level. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber added 4 commits April 18, 2026 08:27

igerber added ready-for-ci Triggers CI test workflows and removed ready-for-ci Triggers CI test workflows labels Apr 18, 2026

igerber added the ready-for-ci Triggers CI test workflows label Apr 18, 2026

Merge remote-tracking branch 'origin/main' into feature/dcdh-survey-v…

a5fbf5a

…ariance-extensions

igerber merged commit 76fb12c into main Apr 18, 2026
19 checks passed

igerber deleted the feature/dcdh-survey-variance-extensions branch April 18, 2026 19:10

igerber mentioned this pull request Apr 18, 2026

Release 3.1.3: dCDH replicate weights, sparse fallback signaling, Zenodo config #322

Merged

Conversation

igerber commented Apr 18, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

igerber commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

igerber commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

igerber commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

igerber commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

igerber commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

igerber commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

igerber commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

igerber commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant